{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![LOGO](../../../img/MODIN_ver2_hrz.png)\n",
    "\n",
    "<center><h2>Scale your pandas workflows by changing one line of code</h2>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 2: Speed improvements\n",
    "\n",
    "**GOAL**: Learn about common functionality that Modin speeds up by using all of your machine's cores."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concept for Exercise: `read_csv` speedups\n",
    "\n",
    "The most commonly used data ingestion method used in pandas is CSV files (link to pandas survey). This concept is designed to give an idea of the kinds of speedups possible, even on a non-distributed filesystem. Modin also supports other file formats for parallel and distributed reads, which can be found in the documentation. We will import both Modin and pandas so that the speedups are evident.\n",
    "\n",
    "**Note: Rerunning the `read_csv` cells many times may result in degraded performance, depending on the memory of the machine**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import modin.pandas as pd\n",
    "import pandas\n",
    "import time\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "def printmd(string):\n",
    "    display(Markdown(string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset: 2015 NYC taxi trip data\n",
    "\n",
    "\n",
    "We will be using a version of this data already in S3, originally posted in this blog post: https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes\n",
    "\n",
    "**Size: ~1.8GB**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modin execution engine setting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import modin.config as cfg\n",
    "cfg.Engine.put(\"dask\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `pandas.read_csv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "pandas_df = pandas.read_csv(path, parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"], quoting=3)\n",
    "\n",
    "end = time.time()\n",
    "pandas_duration = end - start\n",
    "print(\"Time to read with pandas: {} seconds\".format(round(pandas_duration, 3)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Expect pandas to take >3 minutes on EC2, longer locally\n",
    "\n",
    "This is a good time to chat with your neighbor\n",
    "Dicussion topics\n",
    "- Do you work with a large amount of data daily?\n",
    "- How big is your data?\n",
    "- What’s the common use case of your data?\n",
    "- Do you use any big data analytics tools?\n",
    "- Do you use any interactive analytics tool?\n",
    "- What’s are some drawbacks of your current interative analytic tools today?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `modin.pandas.read_csv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "modin_df = pd.read_csv(path, parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"], quoting=3)\n",
    "\n",
    "end = time.time()\n",
    "modin_duration = end - start\n",
    "print(\"Time to read with Modin: {} seconds\".format(round(modin_duration, 3)))\n",
    "\n",
    "printmd(\"### Modin is {}x faster than pandas at `read_csv`!\".format(round(pandas_duration / modin_duration, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Are they equal?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pandas_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "modin_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concept for exercise: Reduces\n",
    "\n",
    "In pandas, a reduce would be something along the lines of a `sum` or `count`. It computes some summary statistics about the rows or columns. We will be using `count`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "pandas_count = pandas_df.count()\n",
    "\n",
    "end = time.time()\n",
    "pandas_duration = end - start\n",
    "\n",
    "print(\"Time to count with pandas: {} seconds\".format(round(pandas_duration, 3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "modin_count = modin_df.count()\n",
    "\n",
    "end = time.time()\n",
    "modin_duration = end - start\n",
    "print(\"Time to count with Modin: {} seconds\".format(round(modin_duration, 3)))\n",
    "\n",
    "printmd(\"### Modin is {}x faster than pandas at `count`!\".format(round(pandas_duration / modin_duration, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Are they equal?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pandas_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "modin_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concept for exercise: Map operations\n",
    "\n",
    "In pandas, map operations are operations that do a single pass over the data and do not change its shape. Operations like `isnull` and `applymap` are included in this. We will be using `isnull`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "pandas_isnull = pandas_df.isnull()\n",
    "\n",
    "end = time.time()\n",
    "pandas_duration = end - start\n",
    "\n",
    "print(\"Time to isnull with pandas: {} seconds\".format(round(pandas_duration, 3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "modin_isnull = modin_df.isnull()\n",
    "\n",
    "end = time.time()\n",
    "modin_duration = end - start\n",
    "print(\"Time to isnull with Modin: {} seconds\".format(round(modin_duration, 3)))\n",
    "\n",
    "printmd(\"### Modin is {}x faster than pandas at `isnull`!\".format(round(pandas_duration / modin_duration, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Are they equal?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pandas_isnull"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "modin_isnull"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concept for exercise: Apply over a single column\n",
    "\n",
    "Sometimes we want to compute some summary statistics on a single column from our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "rounded_trip_distance_pandas = pandas_df[\"trip_distance\"].apply(round)\n",
    "\n",
    "end = time.time()\n",
    "pandas_duration = end - start\n",
    "print(\"Time to groupby with pandas: {} seconds\".format(round(pandas_duration, 3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "rounded_trip_distance_modin = modin_df[\"trip_distance\"].apply(round)\n",
    "\n",
    "end = time.time()\n",
    "modin_duration = end - start\n",
    "print(\"Time to add a column with Modin: {} seconds\".format(round(modin_duration, 3)))\n",
    "\n",
    "printmd(\"### Modin is {}x faster than pandas at `apply` on one column!\".format(round(pandas_duration / modin_duration, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Are they equal?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rounded_trip_distance_pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rounded_trip_distance_modin"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concept for exercise: Add a column\n",
    "\n",
    "It is common to need to add a new column to an existing dataframe, here we show that this is significantly faster in Modin due to metadata management and an efficient zero copy implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "pandas_df[\"rounded_trip_distance\"] = rounded_trip_distance_pandas\n",
    "\n",
    "end = time.time()\n",
    "pandas_duration = end - start\n",
    "print(\"Time to groupby with pandas: {} seconds\".format(round(pandas_duration, 3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "\n",
    "modin_df[\"rounded_trip_distance\"] = rounded_trip_distance_modin\n",
    "\n",
    "end = time.time()\n",
    "modin_duration = end - start\n",
    "print(\"Time to add a column with Modin: {} seconds\".format(round(modin_duration, 3)))\n",
    "\n",
    "printmd(\"### Modin is {}x faster than pandas add a column!\".format(round(pandas_duration / modin_duration, 2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Are they equal?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pandas_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "modin_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Please move on to [Exercise 3](./exercise_3.ipynb) when you are ready**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
